WIP: pathfinder_compatibility_guard_rails#1977
Conversation
Introduce CompatibilityGuardRails plus related errors and tests so callers can opt into CTK and driver compatibility checks while reusing the existing pathfinder lookup APIs. Made-with: Cursor
Expose process_wide_compatibility_guard_rails at import time so follow-up changes can route the default cuda.pathfinder APIs through a stable public instance. Document the singleton and pin its public availability with a small regression test. Made-with: Cursor
Make the process-wide CompatibilityGuardRails instance the default path for the public load/find/locate APIs so top-level calls share compatibility state. Factor the routing/fallback/cache-reset glue into a dedicated internal module to keep `cuda.pathfinder.__init__` focused on the public surface, and fall back to the existing raw resolvers when v1 guard rails only have insufficient metadata. Made-with: Cursor
Allow CUDA_PATHFINDER_COMPATIBILITY_GUARD_RAILS to select strict, best_effort, or off behavior so we can experiment with stricter compatibility checks without changing the public API shape. Made-with: Cursor
Treat driver-packaged libraries as compatibility-neutral so strict mode can load NVML and other driver libs without a raw fallback, while CTK-backed artifacts remain the only items that establish and enforce the process-wide CTK anchor. Made-with: Cursor
Infer the CUDA Toolkit line from both wildcard-pinned and range-based cuda-toolkit requirements so strict process-wide guard rails keep working for editable wheel installs used by nvrtc and nvJitLink. Made-with: Cursor
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
|
Analysis of CI failures for workflow run Cursor GPT-5.4 Extra High Fast Findings
Why
Proper fix
|
Introduce a small toolkit-info utility that reads the CUDA_VERSION macro from cuda.h so follow-up guard-rails changes can infer CTK major.minor from toolkit headers without depending on version.json. Made-with: Cursor
Centralize encoded CUDA version parsing and validation so toolkit and driver version helpers stay aligned and cuda.h parsing gets consistent string conversion and error reporting. Made-with: Cursor
Replace version.json-based CTK root metadata with cuda.h parsing so compatibility checks use a simpler, more universal toolkit source while preserving wheel-based metadata inference. Made-with: Cursor
|
/ok to test |
|
At commit c6c38e3, the CI has a single failure in
That failure does not look like a
Spot-checking sibling logs shows that the underlying
So the most important takeaway from the logs is: the single red test is a combination of two conditions happening in the same job:
That explains why this shows up as only one visible failure even though the broader Issues to look into next:
|
This keeps the host-backed compatibility checks aligned with the resolved toolkit layout and skips cleanly when cudart headers or cuda.h are unavailable. Made-with: Cursor
This keeps the host-backed compatibility checks aligned with the actual driver state instead of relying on a fixed encoded version in the real-environment tests. Made-with: Cursor
|
/ok to test |
This keeps the real host-backed checks strict when an installed nvcc wheel actually ships nvcc, while avoiding a false failure in cu12 wheel environments that only provide lower-level compiler pieces such as ptxas. Made-with: Cursor
PromptI'm coming back to this question as posted under #1977 (comment)
Do we understand this now? ResponseMostly yes. The
That is exactly what the latest logs show:
So for the current failures on PR 1977, yes, we understand the flip. One nuance remains: the older single anomalous local job from the earlier archive, where a local job still got |
|
/ok to test |
Keep the ad hoc conda environment helpers aligned with the current cuda_pathfinder test dependency groups so local verification matches the package's test matrix. Split the PowerShell install list from Linux-only dependencies to avoid pulling unsupported packages on Windows. Made-with: Cursor
Keep diagnostics and tests aligned with the configured default so future rollout changes only need one constant update. Made-with: Cursor
Align CompatibilityGuardRails with the PEP 440 version syntax users already know, and reuse packaging's parser instead of maintaining custom constraint logic. Made-with: Cursor
Separate item validation, pairwise CTK coherence, and driver checks so later component- and pipeline-aware rules can land without changing current guard-rails behavior. Co-authored-by: Cursor <cursoragent@cursor.com>
Record graph-derived dynamic-link groupings and cross-surface companion tags so later guard-rails milestones can add component- and pipeline-aware policy without reworking the catalogs or resolution plumbing. Co-authored-by: Cursor <cursoragent@cursor.com>
Require exact CTK matching only for authored same-component or companion relationships, so independent artifacts can coexist across minors. Add a Linux-only driver-compatibility override for forward-compatibility deployments without relaxing CTK-coherence checks. Co-authored-by: Cursor <cursoragent@cursor.com>
Query NVML for display-driver release metadata and use it to distinguish backward compatibility from NVIDIA's same-major minor-version compatibility. This lets guard rails follow published driver-branch thresholds instead of treating cuDriverGetVersion() as the whole driver story. Co-authored-by: Cursor <cursoragent@cursor.com>
Track declared nvrtc/nvJitLink producer-consumer flows so guard rails can apply NVIDIA's stricter LTOIR rules without over-constraining PTX, ELF, and CUBIN cases. Keep explicit nvvm pipelines conservative until the model can represent NVVM IR version and dialect details. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Tracking progress: With commit 6b4d910 we have reached the final milestone as layed out here: These commits were entirely generate with Cursor GPT-5.4 Extra High Fast (I only glanced through): |
|
/ok to test |
Skip Linux-only driver-forward-compatibility tests on non-Linux hosts and stop treating nvcc discovery as mandatory in see_what_works real-host checks. This keeps platform-specific expectations from obscuring real guard-rails regressions when CI infrastructure and host layouts vary. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/ok to test |
Drop redundant mocked happy-path checks that now overlap with the real-host CI matrix, and add explicit ELF/CUBIN pipeline cases so the remaining mocks stay focused on platform, ordering, and version-corner behavior. This keeps the guard-rails suite easier to maintain without giving up the synthetic coverage that real installs still cannot exercise reliably. Co-authored-by: Cursor <cursoragent@cursor.com>
Move public/process-wide and real-host coverage into dedicated modules while centralizing shared fixtures. This keeps the core policy suite focused without changing guard-rails coverage. Co-authored-by: Cursor <cursoragent@cursor.com>
Share the guard-rails-off fixture and small CTK sandbox builders so the touched pathfinder tests stay easier to extend and less error-prone. Co-authored-by: Cursor <cursoragent@cursor.com>
Move static and bitcode caching to the shared locate layer so strict-mode public APIs reuse the same discovery boundary after process-wide guard-rails indirection. Add symmetric wrapper cache clears and a regression test that exercises the strict-mode path. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/ok to test |
* Add nccl_device to _BITCODE_LIBS_PACKAGED_WITH so the guard-rails resolver layer no longer raises KeyError for a name that is already in SUPPORTED_BITCODE_LIBS; lock the dispatch tables in place with parametrized tests that walk every supported bitcode/static/binary name through _resolve_*_item. * Remove unreachable helpers _pipeline_compatibility_result, _dynamic_lib_pipeline_items, and CompatibilityGuardRails._enforce_declared_dynamic_lib_pipelines_for_pair. The pipeline check still fires from _enforce_declared_dynamic_lib_pipelines_for_item after _remember, which is the only code path that ever produced a result. * Re-export DriverCtkCompatibilityError from cuda.pathfinder so the driver-vs-CTK case (already advertised by the env-var hint) can be caught by type instead of message text, and list it in api.rst. Co-authored-by: Cursor <cursoragent@cursor.com>
…ion, and reset naming * Defer the platform check in CUDA_PATHFINDER_DRIVER_COMPATIBILITY to after the CUDA_PATHFINDER_COMPATIBILITY_GUARD_RAILS=off short-circuit so users who turn guard rails off entirely are no longer forced to also unset the override on non-Linux platforms. The value-validation RuntimeError still fires unconditionally so typos are caught early. * Move the binary packaged_with mapping next to the binary registry as SUPPORTED_BINARIES_PACKAGED_WITH and reclassify nsys / nsight-sys / ncu / nsight-compute as packaged_with="other" so strict-mode lookups for separately packaged Nsight tools no longer raise misleading "missing CTK metadata" errors. * Rename CompatibilityGuardRails._reset_for_testing to _reset_state and document that production cache_clear callers also drive it; configured driver overrides are intentionally re-applied while lazily-queried values are dropped. Co-authored-by: Cursor <cursoragent@cursor.com>
Low-severity polish on the v1 compatibility guard rails surface plus two new tests so the existing invariants are asserted instead of only code-read. - _owned_distribution_candidates: note that symlinks are intentionally not chased on either side of the path comparison. - _missing_ctk_metadata_message now appends the conflicting CTK set when wheel metadata for the same on-disk file matches more than one cuda-toolkit distribution, instead of silently collapsing to "could not determine the CTK version". - _compatible_pair_message picks distinct wording for the same-CTK vs cross-CTK independent-pair cases so the message is no longer misleading when both items share a CTK. - _declare_dynamic_lib_pipeline gains a docstring explaining why it stays single-underscored in v1 (taxonomy/policy still evolving). - Block comment near _STATIC_LIBS_PACKAGED_WITH / _BITCODE_LIBS_PACKAGED_WITH calls out the lockstep requirement with SUPPORTED_*_LIBS and points at the parametrized resolver tests that enforce coverage. - load_nvidia_dynamic_lib augments any CompatibilityCheckError raised during _register_and_check with a sentence explaining the underlying dlopen / LoadLibraryW already happened and the OS handle remains live. Mutates exc.args in place so subclass typing (DriverCtkCompatibilityError) and __cause__ are preserved. - _try_process_wide_guard_rails_then_fallback documents why the forward-compat hint is appended only on Linux (cuda-compat-* is NVIDIA's Linux-only contract). - New test_register_and_check_is_idempotent_for_repeated_items asserts duplicate ResolvedItem registrations collapse to one entry. - New test_driver_ctk_compatibility_error_is_typed_catchable asserts a driver-too-old failure raises DriverCtkCompatibilityError as itself (not just by message), is still a CompatibilityCheckError, and carries the new "OS handle remains live" augmentation. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/ok to test |
…nt contract Collapse the duplicated ``nvmlShutdown()`` calls in ``_query_driver_release_version_text`` into a single ``try/finally`` so the cleanup always runs in one place. The asymmetric error-precedence rule is preserved via ``sys.exc_info()[1]``: when both the NVML body and shutdown fail, the body's error wins (Python keeps the shutdown error on ``__context__`` for debugging); when only shutdown fails, the shutdown error surfaces. Add comments above the matched ``nvmlInit_v2()`` / ``nvmlShutdown()`` pair noting that NVML's init/shutdown is reference-counted, so this balanced pair is safe even when the caller has already initialized NVML elsewhere in the process. Pre-empts a question raised in review on PR NVIDIA#2000. Add two focused tests filling out the cleanup matrix: - ``test_query_driver_release_version_text_raises_when_only_shutdown_fails`` asserts a non-zero shutdown status surfaces when the body succeeded. - ``test_query_driver_release_version_text_body_error_wins_when_both_fail`` locks in the body-error-wins precedence when both calls fail. Co-authored-by: Cursor <cursoragent@cursor.com>
Resolves #1038
Continuation of #1936
WIP — CI testing